A Hadoop-based vulnerability search engine that builds a distributed TF-IDF index over NVD CVE records and provides an interactive Java Swing interface for ranked CVE search.
The system uses Hadoop HDFS for distributed storage, MapReduce for offline indexing, and an in-memory Java search layer for interactive query serving. It was designed around NVD CVE data normalized into JSON Lines, where each line is one independent CVE document.
- Distributed storage of CVE JSONL datasets in HDFS
- MapReduce pipeline for tokenization, inverted index construction, and TF-IDF scoring
- Java Swing GUI with HDFS file management, MapReduce job monitoring, and CVE search
- AND/OR query mode, Top-N search, Show All Results, sortable result columns, and detailed raw JSON view
- Hadoop configuration and preprocessing scripts included for reproducibility
src/ Java source code for MapReduce jobs, CLI search, and Swing GUI
preprocessing/ Python scripts for converting NVD CVE JSON files to JSONL datasets
hadoop-config/ Hadoop XML configuration files used in the 2-node VM cluster
dist/ Built project JAR
screenshots/ GUI, architecture, HDFS UI, and YARN UI screenshots
datasets/ Dataset documentation only; large JSONL files are gitignored
raw-nvd-json/ Raw NVD JSON documentation only; raw JSON files are gitignored
The technical report is included as BLM4821_CVE_Engine_Report.pdf.
The project uses NVD CVE data. Raw CVE JSON files were obtained from the community-maintained fkie-cad/nvd-json-data-feeds repository, which mirrors/reconstructs NVD JSON data feed packages from NVD data:
https://github.com/fkie-cad/nvd-json-data-feeds
The raw JSON files were normalized into JSON Lines format. One JSONL line corresponds to one CVE record and one searchable document. The implementation was tested on CVE data from 2018-2025, but the preprocessing and indexing pipeline can be applied to other NVD year ranges as well.
| Dataset | Size | Document count | HDFS raw path |
|---|---|---|---|
| Compact | 107.5 MB | 222,083 | /raw/cve_2018_2025_compact.jsonl |
| Enriched | 283.1 MB | 222,083 | /raw/cve_2018_2025_enriched.jsonl |
| Large | 523.6 MB | 222,083 | /raw/cve_2018_2025_500mb.jsonl |
Large raw files are intentionally excluded from GitHub. The local datasets/ and raw-nvd-json/ directories are ignored because they are hundreds of megabytes to more than one gigabyte. Use the scripts under preprocessing/ to recreate the JSONL datasets from raw NVD JSON files.
- Host: Windows laptop running VirtualBox
- Guest OS: Ubuntu Server 22.04
- Hadoop: 3.4.1
- Java: 11
- Cluster layout:
hadoopmaster/192.168.56.101: NameNode, ResourceManager, DataNode, NodeManagerhadoop-worker1/192.168.56.102: DataNode, NodeManager
Monitoring interfaces used during development:
- HDFS NameNode UI:
http://192.168.56.101:9870 - YARN ResourceManager UI:
http://192.168.56.101:8088/cluster
The hadoop-config/ directory contains the cluster configuration used for this project:
core-site.xml: default filesystem, e.g.hdfs://hadoopmaster:9000hdfs-site.xml: HDFS directories, replication, NameNode/DataNode settingsmapred-site.xml: MapReduce configured to run on YARNyarn-site.xml: ResourceManager and NodeManager addresseshadoop-env.sh: Hadoop environment variables such asJAVA_HOMEworkers: worker node list
Configuration files were prepared on the master node and copied to the worker node using scp.
Reads raw JSONL CVE records, extracts relevant fields, normalizes text, and emits one tokenized document per CVE.
hadoop jar dist/cve-search.jar com.cvesearch.CveTokenizerJob \
/raw/cve_2018_2025_compact.jsonl /tokens/compactBuilds posting lists from tokenized documents.
hadoop jar dist/cve-search.jar com.cvesearch.InvertedIndexJob \
/tokens/compact /index/compactOutput format:
term -> CVE:tf,CVE:tf,...
Computes TF-IDF scores for each term-document pair.
hadoop jar dist/cve-search.jar com.cvesearch.TfIdfJob \
/index/compact /tfidf/compact 222083Formula:
TF-IDF = TF * log(N / DF)
where N is the document count and DF is the number of documents containing the term.
The GUI loads the TF-IDF index and raw CVE JSONL records from HDFS into memory. Use a larger heap for enriched or large datasets.
java -Xmx2g -cp "dist/cve-search.jar:$(hadoop classpath)" com.cvesearch.CveSearchGUIFor the large dataset:
java -Xmx2500m -cp "dist/cve-search.jar:$(hadoop classpath)" com.cvesearch.CveSearchGUIGUI modules:
- HDFS File Manager: browse HDFS, upload VM file to HDFS, download HDFS file to VM, delete, refresh
- MapReduce Job Monitor: run Tokenizer, Inverted Index, TF-IDF, or Full Pipeline
- Search Interface: load index, search CVEs, sort results, inspect detailed CVE records
Search does not run MapReduce. MapReduce is used only for offline index construction. At query time, the GUI reads the precomputed TF-IDF index from HDFS into memory, combines posting lists using AND/OR logic, ranks CVEs by accumulated TF-IDF score, and displays Top-N or all matches.
Example search results:
| Dataset | Query | Mode | Display | Matches | Shown | Query time |
|---|---|---|---|---|---|---|
| Compact | apache |
AND | Show All | 1910 | 1910 | 0.591 |
| Large | apache |
AND | Top-N 50 | 1934 | 50 | 0.571 s |
| Dataset | Tokenizer | Inverted Index | TF-IDF | Total |
|---|---|---|---|---|
| Compact 107.5 MB | 5m58.959s | 11m29.182s | 3m29.980s | 20m58.121s |
| Enriched 283.1 MB | 6m07.060s | 9m18.300s | 4m26.863s | 19m52.223s |
| Large 523.6 MB | 5m57.599s | 9m08.788s | 2m45.042s | 17m51.429s |
The runtime does not increase monotonically with raw dataset size because this small virtualized Hadoop cluster is affected by HDFS block placement, input splits, disk cache, JVM warm-up, current VM load, YARN scheduling overhead, and output characteristics.
GitHub has practical file and repository size limits. Raw NVD JSON files and generated JSONL datasets are excluded from version control. The repository keeps the source code, preprocessing scripts, Hadoop configuration, screenshots, built JAR, and report while documenting how to recreate the data locally.


